Electronic apparatus and controlling method thereof

ABSTRACT

An electronic apparatus is provided. The electronic apparatus includes a memory and a processor configured to control the electronic apparatus to: classify a plurality of input data into a plurality of types to store in the memory, determine at least one among the input data of the classified plurality of types based on a voice command being recognized among the input data, and provide response information corresponding to the voice command based on the input data of the determined type.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a Continuation of U.S. application Ser. No.16/671,518, filed Nov. 1, 2019, which claims priority to KR10-2018-0133827, filed on Nov. 2, 2018, the entire contents of which areall hereby incorporated herein by reference in their entireties.

BACKGROUND 1. Field

The disclosure relates to an electronic apparatus and a controllingmethod thereof and for example, to an electronic apparatus that providesresponse information corresponding to a user voice command using anartificial intelligence learning model from input data and a controllingmethod thereof.

2. Description of Related Art

In order to perform an operation corresponding to a conventional uservoice command, input data is compared with existing data. For example, arecognized user's voice is compared with previously stored data, or arecognized user's behavior is compared with previously stored data.However, there may be a case that the user voice command is not clearlyunderstood only by comparing with the previously stored data.

Accordingly, in a process of recognizing user voice commands, varioustypes of data such as gestures, emotions, etc. in addition to voice maybe newly analyzed in order to specifically identify user voice commands.However, there was a problem that it takes a lot of time to recognizewhen using all the various types of data. In addition, it was difficultto determine which data to use.

SUMMARY

Embodiments of the present disclosure address the problem describedabove and provide an electronic apparatus that classifies input data bytype, determines input data of a specific type from the classified typeusing the artificial intelligence learning model, and providesinformation corresponding to a user voice command, and a controllingmethod thereof.

An example aspect of example embodiments relates to an electronicapparatus including a memory and a processor configured to control theelectronic apparatus to: classify a plurality of input data into aplurality of types to store in the memory, determine at least one amongthe input data of the classified plurality of types based on a voicecommand being recognized among the input data, and provide responseinformation corresponding to the voice command based on the input dataof the determined type.

The processor may determine at least one among the input data of theclassified plurality of types based on time information related to theuser voice command.

The processor may group the input data classified into the plurality oftypes by a preset time unit and obtain representative data for each ofthe plurality of types corresponding to each time unit based on thegrouped input to store in the memory, and provide response informationcorresponding to the voice command based on the representative data ofthe determined type.

The processor may compare an amount of change in the representative datafor each of the plurality of types corresponding to each time unit, andassign a largest weight on a type having a largest amount of change toprovide response information corresponding to the voice command.

The plurality of types may include at least one of gesture information,emotion information, face recognition information, gender information,age information or voice information.

The processor may recognize the voice command based on at least one ofthe gesture information or voice information among the input data.

The processor may, based on the voice command being recognized in theinput data, recognize the voice command as a preset voice recognitionunit, and determine at least one among the classified plurality of typesbased on a time interval belonging to at least one voice recognitionunit.

The processor may, based on a wake-up word being included in the voicecommand, recognize the voice command as the preset voice recognitionunit based on the time interval where the wake-up word is recognized.

The processor may, based on the response information not being providedbased on the input data input for the preset time interval based on thewake-up word being recognized, provide response informationcorresponding to the voice command using input data input in a previoustime interval before the wake-up word is recognized.

The processor may determine at least one among the input data of theclassified plurality of types based on information on the user'sintention or an object to be controlled which are recognized in thevoice command.

An example aspect of example embodiments relates to a method forcontrolling an electronic apparatus including classifying a plurality ofinput data into a plurality of types to store in a memory, based on avoice command being recognized among the input data, determining atleast one among the input data of the classified plurality of types, andproviding response information corresponding to the voice command basedon the input data of the determined type.

The determining at least one among input data of the classifiedplurality of types may include determining at least one among the inputdata of the classified plurality of types based on time informationrelated to the user voice command.

The storing in the memory may include grouping the input data classifiedinto the plurality of types by a preset time unit and obtainingrepresentative data for each of the plurality of types corresponding toeach time unit based on the grouped input to store in the memory,wherein the providing response information corresponding to the voicecommand includes providing response information corresponding to thevoice command based on the representative data of the determined type.

The providing the response information corresponding to the voicecommand may include comparing an amount of change in the representativevalue for each of the plurality of types corresponding to each timeunit, and assigning a largest weight on a type having a largest amountof change to provide response information corresponding to the voicecommand.

The plurality of types may include at least one of gesture information,emotion information, face recognition information, gender information,age information or voice information.

The determining at least one among the input data of the classifiedplurality of types may include recognizing the voice command based on atleast one of the gesture information or the voice information among theinput data.

The determining at least one among the input data of the classifiedplurality of types may include, based on a voice command beingrecognized in the input data, recognizing the voice command as a presetvoice recognition unit, and determining at least one among the inputdata of the classified plurality of types based on a time intervalbelonging to at least one voice recognition unit.

The determining at least one among the input data of the classifiedplurality of types may include, based on a wake-up word being includedin the user voice command, recognizing the user voice command as thepreset voice recognition unit based on the time interval where thewake-up word is recognized.

The providing the response information corresponding to the voicecommand may include, based on the response information not beingprovided based on the input data input for the preset time intervalbased on the wake-up word being recognized, providing the responseinformation corresponding to the user voice command using the input datainput in a previous time interval before the wake-up word is recognized.

The determining at least one among the input data of the classifiedplurality of types may include determining at least one among the inputdata of the classified plurality of types based on at least one of theuser's intention or an object to be controlled which are recognized inthe voice command.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other aspects, features and advantages of certainembodiments of the present disclosure will be more apparent from thefollowing detailed description, taken in conjunction with theaccompanying drawings, in which:

FIG. 1 is a block diagram illustrating an example electronic apparatusaccording to an embodiment of the disclosure;

FIG. 2 is a block diagram illustrating an example configuration of anexample electronic apparatus of FIG. 1 ;

FIG. 3 is a diagram illustrating an example operation of classifying andstoring input data according to types according to an embodiment of thedisclosure;

FIGS. 4A, 4B, 4C and 4D are diagrams illustrating an example operationof storing and inputting data by a unit of a predetermined timeaccording to an embodiment of the disclosure;

FIG. 5 is a diagram illustrating an example process of classifying inputdata based on time information according to an embodiment of thedisclosure;

FIG. 6 is a diagram illustrating an example in which contents stored ininput vary with time according to an embodiment of the disclosure;

FIG. 7 is a diagram illustrating an example operation of grouping inputdata into a predetermined time interval according to an embodiment ofthe disclosure;

FIG. 8 is a diagram illustrating an example process of obtaining arepresentative value of input data in a grouping process according to anembodiment of the disclosure;

FIG. 9 is a diagram illustrating an example operation of selecting someof a plurality of types according to a preset data according to anembodiment of the disclosure;

FIGS. 10A, 10B, 10C and 10D are diagrams illustrating examples in whicha plurality of types, which are selected in accordance with timeintervals, are different according to an embodiment of the disclosure;

FIG. 11 is a diagram illustrating examples of providing informationcorresponding to a user voice command using voice or emotion dataaccording to embodiments of the disclosure;

FIG. 12 is a diagram illustrating an example operation for storing inputdata temporarily and providing response information corresponding to auser voice command using the temporary input data according to anembodiment of the disclosure;

FIG. 13 is a diagram illustrating an example of providing responseinformation corresponding to a user voice command by applying weights toa plurality of input data according to an embodiment of the disclosure;

FIG. 14 is a diagram illustrating an example operation of an electronicapparatus for each function according to an embodiment of thedisclosure,

FIG. 15 is a diagram illustrating an example operation of an electronicapparatus by time according to an embodiment of the disclosure; and

FIG. 16 is a flowchart illustrating an example operation of anelectronic apparatus according to an embodiment of the disclosure.

DETAILED DESCRIPTION

Before describing various example embodiments of the disclosure ingreater detail, a method for understanding the present disclosure anddrawings will be described.

The terms used in embodiments of the present disclosure are selected asgeneral terminologies currently widely used in consideration of theconfiguration and functions of the present disclosure, but can bedifferent depending on intention of those skilled in the art, aprecedent, appearance of new technologies, and the like. Also, there maybe some arbitrarily selected terms. Such terms may be understoodaccording to meanings defined in the present disclosure, and may also beunderstood based on general contents of the present disclosure and atypical technical concept in the art where the terms are notspecifically defined.

Also, the same reference numerals or symbols described in the attacheddrawings may denote parts or elements that actually perform the samefunctions. For convenience of descriptions and understanding, the samereference numerals or symbols are used and described in differentexample embodiments. In other words, although elements having the samereference numerals are all illustrated in a plurality of drawings, theplurality of drawings do not necessarily refer to only one exampleembodiment.

In addition, in order to distinguish between the components, termsincluding an ordinal number such as “first”, “second”, etc. may be usedin the present disclosure and claims. The ordinal numbers may used todistinguish the same or similar elements from one another, and the useof the ordinal number should not be understood as limiting. The termsused herein are solely intended to explain various example embodiments,and not to limit the scope of the present disclosure. For example, usedorders, arrangement orders, or the like of elements that are combinedwith these ordinal numbers may not be limited by the numbers. Therespective ordinal numbers are interchangeably used, if necessary.

The singular expression also includes the plural meaning as long as itdoes not conflict with the context. The terms “include”, “comprise”, “isconfigured to,” etc., of the description may be used to indicate thatthere are features, numbers, steps, operations, elements, parts orcombination thereof, and they should not exclude the possibilities ofcombination or addition of one or more features, numbers, steps,operations, elements, parts or a combination thereof.

The present disclosure may have several embodiments, and the embodimentsmay be modified variously. In the following description, specificembodiments are provided with accompanying drawings and more detaileddescriptions thereof. However, this does not necessarily limit the scopeof the example embodiments to a specific embodiment form. Instead,modifications, equivalents and replacements included in the disclosedconcept and technical scope of this disclosure may be employed. Whiledescribing example embodiments, if it is determined that the specificdescription regarding a known technology obscures the gist of thedisclosure, the specific description may be omitted.

In the example embodiments of the present disclosure, the term “module,”“unit,” or “part” may be referred to as an element that performs atleast one function or operation, and may be implemented with hardware,software, or a combination of hardware and software. In addition, aplurality of “modules,” a plurality of “units,” a plurality of “parts”may be integrated into at least one module or chip except for a“module,” a “unit,” or a “part” which has to be implemented withspecific hardware, and may be implemented with at least one processor(not shown).

Also, when any part is connected to another part, this includes a directconnection and an indirect connection through another medium. Further,when a certain portion includes a certain element, unless specified tothe contrary, another element may be additionally included, rather thanprecluding another element.

Various portions of various example embodiments of the disclosure may beperformed by a machine learning based recognition system, and thedisclosure may include a classification system based on a series ofmachine learning algorithms based on neural networks, and will describea deep learning based recognition system as an example.

The deep learning based recognition system may include at least oneclassifier, the classifier may correspond to one or a plurality ofprocessors. The processor may be realized as an array of a plurality oflogic gates or may be implemented as a combination of a generalmicroprocessor and a memory in which a program that can be executed inthe microprocessor is stored.

The classifier may be implemented as a neural network based classifier,a support vector machine (SVM), an adaboost classifier, a bayesianclassifier, a perceptron classifier, or the like. Hereinafter, aclassifier of the disclosure may refer, for example, to an embodimentimplemented as a convolutional neural network (CNN) based classifier.The neural network based classifier may refer, for example, to acalculation model realized to simulate a calculation power of biologicalsystem using a large number of artificial neurons connected byconnecting lines, and it may perform human recognition or learningprocess through the connecting lines having a connection strength(weight). However, the classifier of the disclosure is not limitedthereto and will be realized as various classifies described above.

A general neural network may include, for example, and withoutlimitation, an input layer, a hidden layer, and an output layer, and thehidden layer may include one or more layers as necessary. As analgorithm for learning the neural network, a back propagation algorithmmay be used.

When any data is input to an input layer of the neural network, theclassifier can train the neural network such that output data for aninput learning data is output to an output layer of the neural network.When feature information extracted from the input data is input, apattern of the feature information may be classified into one of severalclasses and a classification result may be output using the neuralnetwork.

The processor may include a classification system based on a series ofmachine learning algorithms based on the neural network, and can use thedeep learning based recognition system.

FIG. 1 is a block diagram illustrating an example electronic apparatusaccording to an embodiment of the disclosure.

The electronic apparatus 100 may include a memory 100 and a processor(e.g., including processing circuitry) 120.

The electronic apparatus 100 may include, for example, and withoutlimitation, a TV, a desktop PC, a laptop, a smartphone, a tablet PC, aserver, or the like. The electronic apparatus 100 may, for example, andwithout limitation, be implemented as a system itself in which a cloudcomputing environment is built, that is, a cloud server. In specific,the electronic apparatus 100 may be a device including a deep learningbased recognition system. The above-described example is merely anexample for describing the electronic apparatus and is not necessarilylimited to the above-described device.

The memory 110 may be implemented as an internal memory such as, forexample, and without limitation, a ROM (e.g., electrically erasableprogrammable read-only memory (EEPROM), a RAM included in the processor120, may be implemented as a separate memory, or the like.

The memory 110 may, for example, store a plurality of input data inputsequentially by a plurality of types. An example operation for storingby the plurality of types will be described later in the operation ofthe processor 120.

A memory embedded in the electronic apparatus 100 may be implemented,for example, and without limitation, as at least one of volatile memory(e.g., dynamic RAM (DRAM), static RAM (SRAM), synchronous dynamic RAM(SDRAM), etc.), non-volatile memory (e.g., one-time programmable ROM(OTPROM), programmable ROM (PROM), erasable and programmable ROM(EPROM), electrically erasable and programmable ROM (EEPROM), mask ROM,flash ROM, flash memory (e.g., NAND flash or NOR flash, etc.), a harddisk drive (HDD), a solid state drive (SSD), a memory detachable fromthe electronic apparatus 100, or the like, and may be implemented, forexample, and without limitation, as a memory card (for example, acompact flash (CF), a secure digital (SD), a micro secure digital(Micro-SD), a mini secure digital (Mini-SD), and an extreme digital(xD), a multi-media card (MMC), etc.), an external memory (e.g., USBmemory) connectable to a USB port, or the like.

The processor 120 may include various processing circuitry and performan overall operation for controlling the electronic apparatus. Forexample, the processor may function to control the overall operation ofthe electronic apparatus.

The processor 120 may, for example, and without limitation, beimplemented as a digital signal processor (DSP), a microprocessor, atime controller (TCON), or the like but is not limited thereto. Theprocessor may include, for example, and without limitation, one or moreamong a central processing unit (CPU), a micro controller unit (MCU), amicro processing unit (MPU), a controller, an application processor(AP), a communication processor (CP), ARM processor, or the like, andmay be defined as the corresponding terms. In addition, the processormay be implemented as a system on chip (SoC), a large scale integration(LSI) in which processing algorithms are embedded, or a fieldprogrammable gate array (FPGA).

Functions related to artificial intelligence (hereinafter referred to asAI) according to the disclosure may be operated through the processorand the memory. The processor may include one or more processors. Theone more processors may include, for example, and without limitation, ageneral purpose processor such as a CPU, an AP, a digital signalprocessor (DSP), or the like, a graphics dedicated processor such asGPU, a vision processing unit (VPU), an AI dedicated processor such asan NPU, or the like. One or more processors may include variousprocessing circuitry and control the electronic apparatus to process aninput data according to a predefined operating rule or AI model storedin the memory. When one or more processors include an AI dedicatedprocessor, the AI dedicated processor may be designed with a hardwarestructure specialized for processing a specific AI model.

The predefined operating rule or AI model is characterized by being madethrough learning. Being made through learning may refer, for example, toa basic AI model being trained using a plurality of learning data by alearning algorithm, thereby creating a predefined operation rule or AImodel set to perform a desired feature (or purpose). Such learning maybe made in a device itself in which AI according to the disclosure isperformed, or may be made through a separate server and/or system.Examples of learning algorithms include supervised learning,unsupervised learning, semi-supervised learning, or reinforcementlearning, but are not limited to the above examples.

The AI model may include, for example, a plurality of neural networklayers. Each of the plurality of neural networks may have a plurality ofweight values, and may perform neural network calculation throughcalculations between a calculation result of a previous layer and aplurality of weights. The plurality of weight values that the pluralityof neural network layers have may be optimized by learning results of anAI model. For example, the plurality of weights may be updated to reduceor minimize a loss value or a cost value obtained from the AI modelduring the learning process. The AI neural networks may include, forexample, and without limitation, a deep neural network (DNN), forexample, a convolutional neural network (CNN), a deep neural network(DNN), a recurrent neural network (RNN), a restricted boltzmann machine(RBM), deep belief network (DBN), bidirectional recurrent deep neuralnetwork (BRDNN), deep Q-networks, and the like, but are not limited tothe above examples.

The processor 120 may divide a plurality of input data sequentiallyinput into a plurality of types and store the plurality of types in thememory 110. When a user voice command is recognized among the inputdata, the processor 120 may determine at least one of the input data ofthe plurality of types classified based information related to the uservoice command. The processor 120 may provide response informationcorresponding to the user voice command based on the input data of thedetermined type.

According to another embodiment, in addition to the voice command, e.g.,a user voice command, the user command may be input using a motion or agesture. For example, the processor 120 may have the user commandreceived through an image in addition to the user voice command. Inaddition, the user voice command described herein may be replaced with auser image command.

The input data may be image data or voice data capable of analyzing auser's behavior. The voice data may be obtained through a microphone ofthe electronic apparatus 100 and may be received through a microphone ofan external device according to an embodiment, and the electronicapparatus 100 may receive the voice data only. In addition, the imagedata may be obtained from a camera of the electronic apparatus 100, orthe like, and receive through a camera of the external device accordingto an embodiment, and the electronic apparatus 100 may receive the imagedata only.

The plurality of types may include at least one of gesture information,emotion information, face recognition information, gender information,age information or voice information.

A preset user voice command may be a user's behavior capable of beingrecognized from the voice data and the image data. The user's behaviormay refer, for example, to the user speaking a specific word ortaking/making a specific motion. The preset user voice command may bechanged according to the user's setting.

The processor 120 may analyze (decide) the user voice command usinginput data stored in a plurality of types. The processor 120 may analyzethe user voice command by selecting (determining) only some types of theplurality of types. The processor 120 may use information related to theuser voice command to select some types among the plurality of types.The information related to the user voice command may refer to any andall behaviors or actions of user that can be recognized from voice dataand image data. The processor 120 may recognize the user's behavior oraction using voice data and image data, and obtain a recognized resultas the user's information. In addition, the processor 120 may classifythe obtained user information and input data as a plurality of types,and store them. The processor 120 may analyze the user voice commandusing some types among the plurality of types.

The processor 120 may use the AI learning model to select some types ofthe plurality of types. In more detail, in case of a specific event, theAI learning model may be used in a process of calculating an analysisrecognition rate by selecting any type. The processor 120 may controlthe electronic apparatus to determine a type for which the highestrecognition rate is expected using the AI learning model.

The AI learning model may compare input data corresponding to the uservoice command to response information corresponding to the actual uservoice command in order to analyze the user voice command. If the inputdata is compared with the response information corresponding to the uservoice command, it may be possible to analyze based on which type of aplurality of types described above the user voice command has thehighest recognition rate.

When analyzing a type based on a pre-stored input data classified into aplurality of types, the AI learning model may learn which type has thehighest recognition rate and obtain a criterion by determining a typehaving the highest recognition rate. In addition, when a new input datais received, a recognition rate may be obtained by applying the newinput data to a previous criterion. The AI learning model may determinewhether it has a high recognition rate as before, even when thecriterion in which the new input data is applied changed.

If the recognition rate of the new input data is higher than that of theexisting criterion, the AI learning model may change the existingcriterion. In addition, whenever a new input data is received, thedescribed process may be repeated. After determining a final criterionaccording to another embodiment, an operation of comparing a recognitionrate may be performed in every preset number of input data reception.

The processor 120 may determine at least one among input data of aplurality types classified based on time information related to the uservoice command.

The processor 120 may classify audio data and image data according totime information. For example, if a user utters “Bixby, turn on the airconditioner over there”, the voice data may obtain informationcorresponding to “Bixby”, “over there”, “air conditioner, and “turn on”in accordance with time. In addition, the image data corresponding tothe corresponding to the time may be obtained. The processor 120 mayclassify input data by matching the voice data and image datacorresponding to the time information or the time interval. In addition,the user voice command may be analyzed by classifying voice data andimage data.

The processor 120 may analyze user information corresponding to aspecific time interval based on the time information. The correspondingoperation will be described in greater detail below with reference toFIG. 5 .

The processor 120 may group input data classified as a plurality oftypes by a predetermined time unit, obtain representative data for eachtype corresponding to each time unit based on the grouped input, andstores the representative data for each type in the memory 110, andprovide response information corresponding to the user voice commandbased on the representative data of the determined type.

The grouping operation may refer, for example, to an operation ofarranging a plurality of data and converting the data into one data. Inaddition, the grouping operation may refer, for example, to an operationof converting a plurality of data into one representative value (orrepresentative data). For example, assuming that 10 input data arereceived from 1 second to 10 seconds, the input data received for 10seconds may be grouped and converted into one input data. The one inputdata may be a representative value (or representative data).

The representative data may be one or more. For example, assuming 20input data received every second are received, two groups and tworepresentative values may exist when the grouping operation is performedin units of 10 seconds. In addition, the processor 120 may analyze theuser voice command using the two representative values. Analyzing theuser voice command may refer, for example, to the processor 120obtaining response information corresponding to the user voice command.

An operation of obtaining a representative value may use any one of amaximum value, a minimum value and an average value among a plurality ofinput data according to time order. In addition, the representativevalue may be an adding operation of a plurality of data. For example,assuming that there are four text information “Bixby”, “over there”,“air conditioner” and “turn on”, the processor 120 may convert it intoone text information “Bixby, turn on the air conditioner over there”through the grouping operation.

The processor 120 may, for example, analyze the user's voice and performthe grouping operation on the interval where the user actually speaks.In general, the voice data may include both an interval where the useruttered and an interval where the user did not utter. The processor 120may distinguish an interval where the wake-up word is recognized or aninterval thereafter in order to shorten a processing time. The processor120 may perform the grouping operation on the distinguished specificintervals.

Meanwhile, the grouping operation will be described in greater detailbelow with reference to FIGS. 7 and 8 .

The processor 120 may compare the amount of change in representativedata for each type corresponding to each time unit, assign the largestweight to a type having the largest amount of change, and provideresponse information corresponding to the user voice command. Theoperation of assigning weight will be described in greater detail belowwith reference to FIG. 13 .

The processor 120 may recognize the user voice command based on at leastone of gesture information or voice information among input data.

The user voice command may be analyzed using voice data and image data.The voice and image data may include information for analyzing a user'scommand. For example, an object to be controlled and control command maybe included, and the user may make a specific motion corresponding tothe object to be controlled and the control command.

In addition, when the user voice command is recognized in the inputdata, the processor 120 may recognize the user voice command as a presetvoice recognition unit, and determine at least one among the pluralityof input data classified by the plurality of types based on a timeinterval belonging to at least one voice recognition unit.

When the wake-up word is included in the user voice command, theprocessor 120 may determine a time interval where the wake-up word isincluded and an uttered interval thereafter, and analyze the user voicecommand using the input data included in the determined intervals.

When the wake-up word is included in the user voice command, theprocessor 120 may determine the user voice command as the preset voicerecognition unit based on the time interval where the wake-up word isrecognized.

An end point detection (EPD) operation may be performed to determine aninterval where the user utters or a preset voice recognition unit, and adetailed operation will be described in greater detail below withreference to FIG. 5 .

According to another embodiment, the preset voice recognition unit maybe a preset time interval. For example, when a time interval is set as0.2 seconds, the processor 120 may analyze voice data including thewake-up word as the preset voice recognition unit (0.2 seconds).

When response information cannot be provided based on the input datainput during the preset time interval after the wake-up word isrecognized, the processor 120 may use the input data input in all timeinterval before the wake-up word is recognized and provide the responseinformation corresponding to the user voice command.

When the processor 120 does not recognize the input data correspondingto a specific time interval, the processor 120 may analyze the uservoice command using the input data corresponding to the previous timeinterval. An operation of using input data corresponding to a previoustime interval will be described in greater detail below with referenceto FIG. 12 .

The processor 120 may determine at least one of the input data of theplurality of types classified based on at least one of information on auser's intention or an object to be controlled recognized from the uservoice command. When a preset word included in the user's voice isrecognized, the processor 120 may analyze the user's command using atype corresponding to a preset word. For example, when the user utters“Bixby”, the processor 120 may select only a type related to voicerecognition and analyze the user voice command. Specific operations willbe described in greater detail below with reference to FIGS. 9 and 10 .

The electronic apparatus 100 according to the disclosure, may receivevoice data and image data and analyze a user's behavior. The electronicapparatus 100 may store voice data and image data as input data for eachtype. In addition, when a specific event occurs, the “electronicapparatus” may analyze the user voice command using a type correspondingto the specific event. Since a specific type of data is used instead ofusing all the data, the electronic apparatus 100 may shorten aprocessing time. In addition, since only the type corresponding to thespecific event is used, only the data of a required area (type) may bereflected in a result. Accordingly, the electronic apparatus 100 of thedisclosure may expect an improvement in a recognition rate for the uservoice command.

FIG. 2 is a block diagram illustrating an example configuration of anexample electronic apparatus of FIG. 1 .

Referring to FIG. 2 , the electronic apparatus 100 according to anembodiment may include a memory 110, a processor (e.g., includingprocessing circuitry) 120, a communication interface (e.g., includingcommunication circuitry) 130, a user interface (e.g., including userinterface circuitry) 140, and an input/output interface (e.g., includinginput/output circuitry) 150.

Repeated descriptions of the same operations of the memory 110 and theprocessor 120 as the operations described above may not be repeatedhere.

The processor 120 may include various processing circuitry and controlsthe overall operations of the electronic apparatus 100 using variousprograms stored in the memory 110.

For example, the processor 120 may include a random access memory (RAM)121, a read only memory (ROM) 122, a main central processing unit (CPU)123, first to nth interfaces 124-1 to 134-n, and a bus 125.

In this example, the RAM 121, the ROM 122, the CPU 123, the first to nthinterfaces 124-1 to 134-n, and so on may be connected with each otherthrough the bus 125.

The ROM 122 stores a set of instructions for system booting. If aturn-on command is input and the power is supplied, the main CPU 123copies the O/S stored in the memory 110 into the RAM 121 according tothe command stored in the ROM 122, and boots the system by executing theO/S. In response to the booting being completed, the main CPU 123 maycopy various application programs stored in the memory 110 to the RAM121, and execute the application programs copied to the RAM 121 toperform various operations.

The main CPU 123 accesses the memory 110 to perform booting using theO/S stored in the memory 110. The CPU 1013 may perform variousoperations using the various programs, contents, data, and the likestored in the memory 110.

The first to nth interfaces 124-1 to 134-n may be concatenated with theaforementioned various components. One of the interfaces may be networkinterface which is connected to an external apparatus via a network.

The processor 120 may perform a graphics processing function (videoprocessing function). For example, the processor 120 may generate ascreen including various objects such as an icon, an image, a text, etc.using a calculator (not shown) and a renderer (not shown). Thecalculator (not illustrated) may calculate attribute values, such ascoordinate values at which each object will be represented, forms,sizes, and colors according to a layout of the screen, based on thereceived control instruction. The renderer (not illustrated) maygenerate screens of various layouts including the objects based on theattribute values which are operated by the operator (not illustrated).The processor 120 may perform various image processing processes such asdecoding, scaling, noise filtering, frame rate conversion, andresolution conversion on video data.

The processor 120 may perform processing on audio data. For example, theprocessor 120 may perform various processes, such as decoding,amplification, and noise filtering of the audio data.

The communication interface 130 may include various communicationcircuitry and be an element that communicates with various externalapparatuses according to various types of communication methods. Thecommunication interface 130 may include various communication circuitryincluded in various communication modules, such as, for example, andwithout limitation, a Wi-Fi module 131, a Bluetooth module 132, aninfrared communication module 133, a wireless communication module 134,and so on. The processor 120 may perform the communication with variousexternal apparatuses using the communication interface 130. The externalapparatus may include, for example, and without limitation, a displaydevice such as a TV, an image processing device such as a set-top-box,an external server, a control device such as a remote control, an audiooutput device such as a Bluetooth speaker, a lighting device, a smartcleaner, home appliances such as a smart refrigerator, a server such asIOT home managers, and the like.

The Wi-Fi module 131 and the Bluetooth module 132 may performcommunication using a Wi-Fi method and a Bluetooth method, respectively.In the case of using the Wi-Fi module 131 or the Bluetooth module 132,connection information such as a service set identifier (SSID) and asession key may be received and transmitted first, and communication maybe connected using the connection information, and then, variousinformation may be received and transmitted.

The infrared communication module 133 may perform communicationaccording to infrared data association (IrDA) technology that transmitsdata wirelessly to a short distance using infrared rays between timelight and millimeter wave.

The wireless communication module 134 may refer, for example, to amodule that performs communication according to various communicationstandards, such as a zigbee, 3rd Generation (3G), 3rd GenerationPartnership Project (3GPP), Long Term Evolution (LTE), LTE-A, 4G (4thGeneration), 5G (5th Generation), and the like, in addition to the Wi-Fimodule 131 and the Bluetooth module 132 described above.

The communication interface 130 may include at least one of a local areanetwork (LAN) module, an Ethernet module, or a wired communicationmodule performing communication using a pair cable, a coaxial cable, oran optical fiber cable.

According to an embodiment, the communication interface 130 may use thesame communication module (e.g., a Wi-Fi module) to communicate with anexternal apparatus and an external server such as a remote controller.

According to another embodiment, the communication interface 130 may usea different communication module (e.g., a Wi-Fi-module) to communicatewith an external apparatus and external server such as a remotecontroller. For example, the communication interface 130 may use atleast one of the Ethernet module or the Wi-Fi module to communicate withan external server, and may use a BT module to communicate with anexternal apparatus such as a remote controller.

However, this is only an example and the communication interface 130 mayuse at least one communication module among various communicationmodules in the case of communicating with a plurality of externalapparatuses or external servers.

The communication interface 130 may further include a tuner and ademodulator according to an embodiment.

The tuner (not shown) may receive an RF broadcast signal by tuning achannel selected by a user or all channels stored in advance among radiofrequency (RF) broadcast signals through an antenna.

The demodulator (not shown) may receive and demodulate a digital IFsignal (DIF) converted by the tuner and perform channel decoding.

The user interface 140 may include various user interface circuitry andmay, for example, and without limitation, be implemented as a devicesuch as a button, a touch pad, a mouse, a keyboard and the like, or maybe implemented as a touch screen that can also perform theabove-described display function and operation unit function. The buttonmay be various types of buttons, such as a mechanical button, a touchpad, a wheel, etc. which are formed on any region, such as the front,side, or rear of the main body of the electronic apparatus 100.

The input/output interface 150 may include various input/outputcircuitry and may, for example, and without limitation, be one or moreof a high definition multimedia interface (HDMI), a mobilehigh-definition link (MHL), a universal serial bus (USB), a display port(DP), a thunderbolt, a video graphics array (VGA) port, an RGB port, aD-subminiature (D-SUB), a digital visual interface (DVI), or the like.

The HDMI may refer, for example, to an interface capable of transmittinghigh-performance data for AV devices that input and output audio andvideo signals. The DP may refer, for example, to an interface capable ofrealizing 1920×1080 full HD, ultra-high resolution screens such as2560×1600 or 3840×2160, and 3D stereoscopic images, and digital audio.The thunderbolt may, for example, refer to an input/output interface fortransmitting and connecting a high-speed data, and can connect a PC, adisplay, a storage device, etc. in one port in parallel.

The input/output interface 150 may input/output at least one among theaudio and video signals.

According to an embodiment, the input/output interface 150 may include aport for inputting/outputting only an audio signal and a port forinputting/outputting only a video signal as separate ports, or may beimplemented as one port for inputting/outputting both the audio signaland the video signal.

The electronic apparatus 100 may be implemented as a device which doesnot include a display to transmit the video signal to a separate displaydevice.

The electronic apparatus 100 may transmit a corresponding voice signalto an external server for voice recognition of the voice signal receivedfrom the external device.

For example, the communication module for communicating with theexternal device and the external server may be the same as the Wi-Fimodule.

The communication module for communicating with the external device andthe external server may be implemented separately. For example, theexternal device may communicate with a Bluetooth module and the externalserver may communicate with an Ethernet modem or a Wi-Fi module.

The electronic apparatus 100 according to an embodiment may transmit adigital voice received to a voice recognition server. In this example,the voice recognition server may convert the digital voice signal totext information using speech-to-text (STT). In this example, the voicerecognition server may transmit text information to another server orthe electronic apparatus to perform a search corresponding to the textinformation. In some cases, the voice recognition server may directlyperform a search.

The electronic apparatus 100 according to another embodiment maydirectly apply the STT function to the digital voice signal, convert thetext information into text information, and transmit the converted textinformation to an external server.

The components illustrated in FIG. 2 may further include a display and aspeaker.

The display may be implemented as various types of displays, such as,for example, and without limitation, a liquid crystal display (LCD), anorganic light emitting diodes (OLED) display, a plasma display panel(PDP), or the like. The display may include a driving circuit, abacklight unit, etc. which may be implemented in the form of an a-siTFT, a low temperature poly silicon (LTPS) TFT, an organic TFT (OTFT),or the like. The display may be implemented as a touch screen, aflexible display, a 3D display, or the like combined with a touchsensor.

In addition, the display according to an embodiment may include not onlya display panel for outputting an image but also a bezel housing thedisplay panel. For example, the bezel according to an embodiment mayinclude a touch sensor (not shown) for detecting a user interaction.

The speaker may be a component that outputs not only various audio dataprocessed by the input/output interface 150, but also various noticesounds or voice messages.

The electronic apparatus 100 may further include a microphone (notshown). The microphone may, for example, refer to a configuration forreceiving and converting a voice of a user or other sounds into theaudio data. In this example, the microphone may convert and transmit areceived analog user voice signal to the electronic apparatus 100.

The microphone (not shown) may receive the user voice in an activatedstate. For example, the microphone may be formed integrally with theupper side, the front side, or the side of the electronic apparatus 100.The microphone may include various configurations, such as a microphonefor collecting user voice in analog form, an amplifier circuit foramplifying collected user voice, A/D conversion circuit for sampling andconverting amplified user voice to digital signal, a filter circuit forremoving noise component from converted digital signal, or the like.

FIG. 3 is a diagram illustrating an example operation of classifying andstoring input data according to types according to an embodiment of thedisclosure.

The electronic apparatus 100 may recognize both voice data and imagedata of users. For example, the electronic apparatus 100 may receiveuser's speech through the microphone. In addition, the electronicapparatus 100 may receive an appearance of users through a camera.

For example, when a user utters “Bixby, turn on air conditioner overthere” while pointing his/her hand toward a certain direction, theelectronic apparatus 100 may recognize the user's behavior using amicrophone and a camera.

The electronic apparatus 100 may receive audio data and video data asinput data. The electronic apparatus 100 may store the received inputdata for each of a plurality of types. For example, the electronicapparatus 100 may classify and store the input data into face ID,gesture, gender, age, emotional (image), voice ID, text, emotional(text), and emotional (voice).

The face ID may refer to a unique ID according to the user's face, andthe electronic apparatus 100 may obtain a face ID using a method ofdetermining a unique characteristic of the user, such as an iris oroutline of the face.

The gesture may refer to an operation of pointing a specific directionor taking a specific motion using the user's hand, finger or arm.

The text may refer to a result of converting voice data spoken by theuser into text.

The emotional (video) may refer to the user's emotional state recognizedthrough the video, and the motion (text) may refer to the user'semotional state analyzed using only the text result. In addition, theemotional (voice) may refer to the user's emotional state analyzed usingonly the voice data.

The type corresponding to the face ID, the gesture, and the emotional(video) may be a result of analysis using the video data, and the typecorresponding to the voice ID, the text, the emotional (text), theemotional (voice) may be a result of analysis using the voice data.

The type corresponding to gender and age may be a result of analysisusing at least one of the voice data or video data.

Referring to FIG. 3 , the electronic apparatus 100 may store the face IDas face-user 1 and the gesture as left. The left may indicate that thedirection pointed by the user's finger is left. According to anotherembodiment, a name of a specific motion may be stored. In addition, theelectronic apparatus 100 may store a gender as a female, an age as 30,and an emotional (video) as joy 0.6. The joy 0.6 may describe at leastone of a value representing an emotional state or a probabilitycorresponding to the emotional state. In addition, the electronicapparatus 100 may store a voice ID as voice-user 1, a text as “Bixby,turn on the air conditional over there”, an emotional (text) as neutral0.6, and an emotional (voice) as joy 0.8.

The electronic apparatus 100 may distinguish and describe the face IDand the voice ID, but may determine as the same user, substantially.When the face ID and the voice ID are not matched, the electronicapparatus 100 may notice that the analysis is wrong and perform therecognition operation again.

FIGS. 4A, 4B, 4C and 4D are diagrams illustrating an example operationthat stores input data by a unit of a predetermined time according to anembodiment of the disclosure.

Referring to FIGS. 4A, 4B, 4C and 4D, the electronic apparatus may be ina state of storing input data in, for example, a time sequence. Forexample, if an embodiment that a user utters “Bixby, turn on the airconditioner over there” is supposed, the input data may be stored basedon the time sequence uttered by the user.

The electronic apparatus 100 may recognize that a user utters “Bixby” inthe interval illustrated in FIG. 4A, “over there” in the intervalillustrated in FIG. 4B, “air conditioner” in the interval illustrated inFIG. 4C, and “turn on” in the interval illustrated in FIG. 4D based onvoice data.

In addition, the electronic apparatus 100 may not recognize a specialgesture operation in the interval illustrated in FIG. 4A, and recognizea gesture operation pointing at the left sides in the intervalsillustrated in FIGS. 4B, 4C and 4D based on video data.

In addition, the electronic apparatus 100 may recognize that a user'semotion corresponds to joy in the intervals illustrated in FIGS. 4A and4B, and neutral in intervals illustrated in FIGS. 4C and 4D based on thevideo data. In addition, the electronic apparatus may recognize that theuser's emotion corresponds to joy in the interval illustrated in FIG. 4Aand may not recognize the user's emotion in the interval illustrated inFIG. 4B, and may recognize that the user's emotion corresponds to joy inthe intervals illustrated in FIGS. 4C and 4D based on the voice data.

Referring to FIGS. 4A, 4B, 4C and 4D, the electronic apparatus 100 mayreceive input data according to a time interval and analyze a typeaccording to each time interval differently. For example, the electronicapparatus may receive the user's action changed according to the timeinterval as data, and analyze it according to the time interval to storethe same.

FIG. 5 is a diagram illustrating an example process of classifying inputdata based on time information according to an embodiment of thedisclosure.

Referring to FIG. 5 , the electronic apparatus 100 may distinguish voiceuttered by a user based on time information. For example, it is supposedthat the user's voice data is received for two seconds and the userutters “Bixby, turn on the air conditioner over there.”

The voice data corresponds to 2 seconds, but time interval actuallyuttered by the user actually may be less than 2 seconds. Referring toFIG. 5 , the time interval actually uttered by the user may be between0.4 to 1.2 seconds. The time for not uttering voice between “Bixby” and“turn on the air conditioner over there” may be included. The electronicapparatus 100 may determine time interval actually uttered by the user.The electronic apparatus 100 may analyze the voice data and determine aninterval where amplitude of waveform for sound is continuously greaterthan an arbitrary value as one interval.

For example, the electronic apparatus may determine a time point fromwhich an amplitude of waveform of sound is greater than an arbitraryvalue to a time point which is less than the arbitrary value as oneinterval (t1). In addition, the electronic apparatus 100 may determine atime point from which the amplitude of waveform of sound is greater thanan arbitrary value to a time point which is less than the arbitraryvalue as a new interval (t2).

As a result, the electronic apparatus 100 may determine time intervals(t1 and t2) classified by analyzing voice data.

Referring to FIG. 5 , when the electronic apparatus recognizes a wake-upword, it may analyze utterance afterwards. For example, assuming thatthe electronic apparatus 100 stores Bixby as a wake-up word, and whenthe user utters “Bixby, turn on the air conditioner over there”, theelectronic apparatus 100 may recognize Bixby as the wake-up word andperform an operation of the end point detection (EPD) on a subsequentutterance of the user. That is, the electronic apparatus 100 performsthe EPD operation after recognizing the wake-up word, Bixby, so that itmay classify a time interval corresponding to t2 separately.

FIG. 6 is a diagram illustrating an example of contents stored in inputdata vary with time according to an embodiment of the disclosure.

Referring to FIG. 6 , the electronic apparatus 100 may classify andstore input data according to a plurality of types in accordance with atime sequence. The electronic apparatus 100 may recognize that a userpoints to the left with a finger from 0.8 second or later and may bestored as left in a gesture area. The electronic apparatus 100 mayrecognize that the user's emotional state corresponds to joy between 0.1and 0.8 seconds based on an image data, and that the user's emotionalstate corresponds to neutral between 0.9 to 1.2 seconds. According toanother embodiment, the electronic apparatus may store a probabilitycorresponding to the emotional state.

The electronic apparatus 100 may store text information according totime based on voice data. The electronic apparatus 100 may classify andstore text information corresponding to “Bixby, turn on the airconditioner over there” in accordance with a time interval. The “Bixby”text information corresponding to 0.5 second may actually be textinformation obtained from voice data corresponding to between 0.5 to 0.6seconds.

Referring to FIG. 6 , the electronic apparatus 100 may classify andstore the plurality of types using input data received in accordancewith the time sequence.

FIG. 7 is a diagram illustrating an example operation of grouping inputdata into a preset time interval according to an embodiment of thedisclosure.

As for FIG. 7 , it assumes, for convenience of illustration andexplanation, that a user's voice is uttered between 0.5 to 1.3 secondsand a preset time interval is 0.2 seconds.

The electronic apparatus 100 is able to distinguish an interval (0.5 to1.2 seconds, R1, R2, R3 and R4) from which the user's voice is actuallyrecognized and an interval (0 to 0.5 seconds, R0) from which the user'svoice is not recognized. In addition, the electronic apparatus 100 maydivide the intervals (0.5 to 1.2 seconds) where the user's voice isrecognized into the preset time interval (0.2 seconds). The electronicapparatus 100 may group the intervals where the user's voice isrecognized at 0.2 second intervals. Specifically, the electronicapparatus 100 may group 0.5 to 0.7 seconds into interval R1, 0.7 to 0.9seconds into interval R2, 0.9 to 1.1 seconds into interval R3, and 1.1to 1.3 seconds into interval R4.

The electronic apparatus 100 may simplify the input data by performing agrouping operation. When storing input data in accordance with all time,processing time and storage space can be wasted. Accordingly, theelectronic apparatus 100 may simplify a process of storing input datathrough the grouping operation.

In order to simplify the data storing process through the groupingoperation, a process of organizing data is required, and the operationfor organizing data will be described in greater detail below withreference to FIG. 8 .

FIG. 8 is a diagram illustrating an example process of obtaining arepresentative value of input data in a process of grouping according toan embodiment of the disclosure.

Data corresponding to R1, R2, R3 and R4 illustrated in FIG. 8 may refer,for example, to data obtained through the grouping process according toFIG. 7 . The electronic apparatus 100 may group input data by a presettime interval, and obtain a representative value for each type ofinformation on the grouped time intervals. For example, there are twoinput data information for each type in an interval corresponding to R1in FIG. 7 . The electronic apparatus 100 may obtain one representativevalue for each type using the two data. Meanwhile, according to anotherembodiment, the electronic apparatus 100 may obtain a representativevalue using two or more data. The number of data used in obtaining therepresentative value may be adjusted by changing the preset timeinterval (0.2 seconds).

The electronic apparatus 100 may obtain a representative value using anyone of an average value, a minimum value and a maximum value of aplurality of grouped input data. In addition, when only a portion of theplurality of grouped input data exists, the electronic apparatus 100 mayexclude a portion of the input data that does not exist and obtain arepresentative value only with the input data.

The electronic apparatus 100 may obtain the representative value bysumming the plurality of grouped input data. For example, in the R1interval of FIG. 7 , text information may be stored by dividing “Bix”and “by”. The electronic apparatus 100 may obtain “Bixby” by combiningthe text information of “Bix” and “by”, and store “Bixby” as arepresentative value.

According to another embodiment, various methods may be applied to theelectronic apparatus 100, and the electronic apparatus 100 is notlimited to the embodiments described above.

FIG. 9 is a diagram illustrating an example operation of selecting someof a plurality of types according to a preset data according to anembodiment of the disclosure.

Referring to FIG. 9 , if a received input data matches a preset data,the electronic apparatus 100 may determine a type corresponding to thepreset data. For example, assume, for ease of description andillustration, a user utters “Bixby, turn on the air conditioner overthere.” When text information of Bixby is recognized, the electronicapparatus 100 may provide response information corresponding to a uservoice command based on a type corresponding to voice ID, textinformation, emotional (text), and emotional (voice) among input data.The electronic apparatus 100 may analyze that the “Bixby” is related toa voice recognition device and provide response informationcorresponding to the user voice command using only a type related to thevoice data.

In addition, when the text information of “over there” is recognized,the electronic apparatus 100 may provide response informationcorresponding to the user voice command using a gesture type. Theelectronic apparatus 100 may determine that the word “over there” refersto a direction or the like, and may provide response informationcorresponding to the user voice command using the gesture type which isa type related to a direction.

When text information recognized in voice data is recognized as at leastone word of “there, here, that, this, left side, right side, east, west,south, north, up, down, left and right”, the electronic apparatus 100may provide response information corresponding to the user voice commandusing a gesture type.

The disclosure is not limited to the embodiments described above, theelectronic apparatus 100 may provide response information correspondingto the user voice command using other types in accordance with a wordset by a user.

In addition, the electronic apparatus 100 may use an AI learning modelto determine a type corresponding to a word set by a user. For example,except that a user designates a preset word and a type corresponding tothe preset word, the AI learning model may directly match the presetword and the type corresponding to the preset word, except that a userdesignates a preset word and a type corresponding to the preset word.

The AI learning model may obtain a criteria of determination based on aplurality of pre-stored input data. The AI learning model may analyze arelation between a specific word and a type based on a large amount ofinput data. For example, a word corresponding to Bixby may not besignificantly affected by gender or age types in obtaining responseinformation corresponding to the user voice command. On the other hand,when a word corresponding to Bixby is uttered by a user, voice ID, text,emotional (text) and emotional (voice) types may have a huge amount ofinfluence over a result. The AI learning model may compare the word“Bixby” to all of the plurality of types and determine with a criterionfor analyzing the user voice command by selecting only a type having aspecific weight or more.

The artificial intelligence learning model may determine a criterion toselect the type having the highest recognition rate by comparing variouswords and a plurality of types in addition to the Bixby word describedabove.

Since the electronic apparatus 100 selectively uses a type correspondingto the preset text information, it may improve a data processing speed.For example, when analyzing data in a conventional manner, all types ofdata should be analyzed. However, when selectively using some types asdescribed in FIG. 9 , the data processing speed may be improved. Inaddition, since a result is obtained by reflecting only a data requiredby a user, a recognition may be increased.

FIGS. 10A, 10B, 10C and 10D are diagrams illustrating examples in whicha plurality of types, which are selected in accordance with timeintervals, are different according to an embodiment of the disclosure.

FIGS. 10A, 10B, 10C and 10D are diagrams illustrating an embodimentaccording to FIG. 9 . For example, the electronic apparatus 100 mayreceive input data corresponding to user's utterance in the order ofFIGS. 10A, 10B, 10C and 10D.

Input data for each time step disclosed in FIGS. 10A-10D may bepartially the same as the embodiment disclosed in FIG. 4 . Referring toFIG. 10A-10D, the electronic apparatus 100 may differently determinesome of a plurality of types for each time interval according to textinformation included in the user's utterance and time intervals. Forexample, if a user utters Bixby in the interval illustrated in FIG. 10A,the electronic apparatus 100 may determine a type corresponding to voiceID, text, emotional (text), and emotional (voice) in the intervalsillustrated in FIGS. 10A, 10B, 10C and 10D, respectively.

The electronic apparatus 100 may recognize text informationcorresponding to “over there” in the interval illustrated in FIG. 10B.In this example, the electronic apparatus 100 may determine a gesturetype in the intervals illustrated in FIGS. 10B, 10C and 10D and provideresponse information corresponding to the user voice command.

Referring to FIGS. 10A-10D, the electronic apparatus 100 may provideresponse information corresponding to the user voice command byselecting another type according to a time interval. When analyzing alldata corresponding to the gesture type regardless of the time interval,processing time may be long. Since the electronic apparatus 100selectively uses only input data corresponding to a specific timeinterval, it may shorten the processing time.

FIG. 11 is a diagram illustrating various examples of providinginformation corresponding to a user voice command using voice or emotiondata.

An operation that determines whether to provide response informationcorresponding to a user voice command using which type according to areceived input data.

An example 1 assumes a case in which a user utters “Bixby, turn on theair conditioner over there.” A wake-up word may correspond to Bixby.When Bixby uttered by the user is recognized, the electronic apparatus100 may determine a text type. When the user utters “the air conditionerover there”, the electronic apparatus 100 may determine a text type anda gesture type. In addition, when the user utters “turn on”, theelectronic apparatus may determine a voice ID type.

An example 2 assumes a case in which the user utters “Bixby, buy thisbook.” A wake-up word may correspond to Bixby. When the user utters“this book”, the electronic apparatus 100 may determine a text type anda gesture type, and when the user utters “buy”, it may determine atleast one of voice ID or face ID.

An example 3 assumes a case in which the user utters “Bixby, play balladmusic.” A wake-up word may correspond to Bixby. When the user utters“ballad music”, the electronic apparatus 100 may determine a text type.When the user utters “play”, the electronic apparatus 100 may determinevoice ID type. When the electronic apparatus 100 determines that theuser's emotion is depressed in at least one of voice data or image data,the electronic apparatus may determine emotional (image), emotional(text) and emotional (voice) types.

An example 4 assumes a case in which the user utters “Bixby, register aplan for travel on December 2.” A wake-up word may correspond to Bixby.When the user utters “December 2” and “travel”, the electronic apparatus100 may determine text. When the user utters “register a plan,” theelectronic apparatus 100 may determine at least one of voice ID or faceID. In addition, when the electronic apparatus 100 determines that theuser's emotion is in a pleasant mood, the electronic apparatus 100 maydetermine emotional (image), emotional (text), and emotional (voice).

Examples of an operation that determines an emotional state determinedin the examples 3 and 4 will be described. The electronic apparatus 100may use at least one of image data, voice data or text informationanalyzed from the voice data in order to determine the user's emotionalstate. When the electronic apparatus 100 recognizes that the user'semotion shows a particular emotional state such as pleasure, sadness orthe like in the image data, the voice data or the text informationanalyzed from the image data, the electronic apparatus 100 may determineall of motion (image), emotional (text) and emotional (voice).

In FIG. 11 , the electronic apparatus 100 determines a particular typemay refer, for example, to providing response information correspondingto the user voice command using the particular type.

The disclosure is not limited to the examples above, and the electronicapparatus 100 may select or determine in various ways in accordance witha user's setting.

Referring to FIG. 11 , the electronic apparatus 100 may, for example,use only a type corresponding to a preset data and provide responseinformation corresponding to the user voice command.

FIG. 12 is a diagram illustrating an example operation of storing inputdata temporarily and providing response information corresponding to auser voice command using the temporary input data according to anembodiment of the disclosure.

The electronic apparatus 100 may temporarily store input data stored ina plurality of types for predetermined time. In addition, when anadditional command of the user is input, the electronic apparatus 100may provide response information corresponding to the user voice commandusing the input data stored temporarily.

For example, when a user does not point a specific direction whileuttering “set temperature to 24 degrees” after performing an operationof pointing a specific direction while uttering “Bixby, turn on the airconditioner over there”, the electronic apparatus 100 may not obtaininformation corresponding to the user voice command “set the temperatureto 24 degrees.” This is because the user voice command “set temperatureto 24 degrees” does not have a target to be controlled. Accordingly,when information corresponding to the user voice command is notobtained, the electronic apparatus 100 may provide response informationcorresponding to the user voice command using a previous analysis storedtemporarily.

The aforementioned embodiments may use information corresponding to thegesture type or text type that can recognized the target since thetarget is not recognized. The text type obtained in the previous timeinterval includes text information of “Bixby, turn on the airconditioner over there” so that the electronic apparatus may determinethat the air conditioner is a target to be controlled.

According to another embodiment, the electronic apparatus 100 may notimmediately use the temporarily stored data. When type information of atleast one of face ID or voice ID matches, the electronic apparatus 100may use the temporarily stored data.

Referring to FIG. 12 , the electronic apparatus 100 may increase arecognition rate for the user voice command using input datacorresponding to the previous time interval, thereby improvingconvenience from the user's point of view.

FIG. 13 is a diagram illustrating an embodiment that provides responseinformation corresponding to a user voice command by assigning weight toa plurality of input data according to an embodiment of the disclosure.

The electronic apparatus 100 may provide select some types among theplurality of types and provide response information corresponding to theuser voice command. Response information corresponding to the user voicecommand may be obtained by assigning different weights from each otheraccording to the selected type.

Referring to FIG. 11 , an operation of analyzing an emotional state isdescribed in an example 3. In order to analyze the emotional state, theelectronic apparatus 100 may use emotional (image), emotional (text) andemotional (voice). When a user's emotional state corresponds to sadnessin any one of emotional (image), emotional (text) and emotional (voice)types, the electronic apparatus 100 may determine a final emotionalstate using all of the types which are emotional (image), emotional(text), and emotional (voice). For example, although the sadness stateis determined in a specific time interval, it may be separately checkedwhether the user's final emotional state corresponds to sadness.

In order to check whether the user's final emotional state correspondsto sadness separately, the electronic apparatus 100 may use all ofemotional (image), emotional (text), and emotional (voice) types. Theelectronic apparatus 100 may assign different weights to each type. Inaddition, the electronic apparatus 100 may determine weights inconsideration of the amount of change in data. In detail, the electronicapparatus 100 may assign a larger weight as a type of the amount ofchange in data is larger.

Referring to FIG. 13 , data in emotional (image) and emotional (text)types are constant showing sad or neutral, but emotional (voice) typehas been changed from neutral to sad. The electronic apparatus 100 mayanalyze a final emotional state of the user by applying a large weightto the emotional (voice) type having the largest change in data.

The AI learning model may be able to determine a criterion fordetermination in assigning weights by learning itself. The AI learningmodel may analyze a large amount of stored input data and responseinformation corresponding to input data, and calculate a recognitionrate by applying various weight values to a type having the largestamount of change in data. In addition, the AI learning model maydetermine a weight value having the highest recognition rate amongvarious weight values.

The electronic apparatus 100 may determine whether to apply a differentweight value applied for each type. For example, if the type having thelargest amount of change in data is emotion (voice), a weight of 0.9 maybe applied, but if the type having the largest amount of change in datais gesture, a weight 0.5 may be applied. The AI learning model may applyvarious methods regarding these operations based on the recognitionrate, thereby determining whether to apply the same weight for each typeor to apply different weights for each type.

When applying different weights for each type, the electronic apparatus100 may increase the recognition rate for the user voice command. A partwhere the user's emotion is reflected may be different for each user.The electronic apparatus 100 may determine a part where emotions arewell expressed and reflect it in an operation of recognizing by settingdifferent weights according to data changes.

FIG. 14 is a diagram illustrating an example operation of an electronicapparatus for each function according to an embodiment of thedisclosure.

The electronic apparatus 100 may provide response informationcorresponding to a user voice command using, for example, a robotoperating system (ROS) framework, an interaction manager, and anapplication.

The electronic apparatus 100 may receive voice or image data from therobot operation system (ROS) framework. In addition, the electronicapparatus 100 may recognize user's information using the received data.For example, the electronic apparatus 100 may include various enginesfor speaker recognition using body gesture, finger pointing, emotion(image, text and voice), face, gender and age of user, and automaticspeech recognition (ASR). The electronic apparatus 100 may selectivelyuse data which is necessary for determining the various user informationdescribed above.

Interaction manager may include modules of engine connector, servicepresenter, semantic analyzer, context manager, and dialog system.

The engine connector module may include various processing circuitryand/or executable program elements and perform engine connection capableof obtaining various user information. In addition, the engine connectormodule may transmit a specific command to obtain user informationobtained from the specific engine. The engine connector module maytransmit the obtained user information to the sematic analyzer.

The service presenter module may include various processing circuitryand/or executable program elements and communicate with an application.In detail, the service presenter module may receive a final analysisresult from the semantic analyzer and transmit it to the application.

The semantic analyzer module may include various processing circuitryand/or executable program elements and receive data corresponding touser information from various engines, and analyze the datacorresponding to the received user information to perform a finalanalysis operation. The semantic analyzer module may control thespecific command to be transmitted to the engine using the engineconnector module in order to perform the final analysis operation, andthen transmit a result received from the dialog system module to theservice presenter.

The context manager module may include various processing circuitryand/or executable program elements and store and manage data occurredfrom the interaction manager. The context manager module may exchangeinformation between different interaction managers. In addition, thecontext manager may store and manage user information.

The dialog system module may include various processing circuitry and/orexecutable program elements and communicate with an external serverincluding a conversation function. The dialog module may transmit anoutput of internal operations to the external server. In addition, thedialog system module may receive the output from the external server andtransmit it to the semantic analyzer. The dialog system module maytransmit information on a request for executing a specific task and aresult thereof to the context manager module. In addition, the dialogsystem module may perform operations related to a natural languageunderstanding (NLU), a dialog management (DM), and a natural languagegeneration (NLG) functions, each of which may include various processingcircuitry and/or executable program elements.

The application may include at least one of a messenger control module,an avatar control module, and a fashion recommendation module.

A robot operating system (ROS) framework, the interaction Manager, andthe application may be connected through a robot operating system (ROS)interface and a representational state transfer (RST) interface.

The ROS framework may be connected to the interaction manager and theapplication through the ROS interface, and may be connected to thevarious engines described above through the ROS interface. A fashionrecommendation module (module which is not included in the application)may be connected to the interaction manager through the REST interface.In addition, a text-to-speech (TTS) or speech to text (STT) module maybe connected to the application through the REST interface.

FIG. 15 is a diagram illustrating an example operation of the electronicapparatus according to an embodiment of the disclosure.

Referring to FIG. 15 , an ROS framework may transmit and receive an ROSmessage with the engine connector (S1). The engine connector module maytransmit a recognition analysis data to the service presenter module(S2). The semantic analyzer module may transmit an automatic speechrecognition result to the service presenter module (S3-1) and theservice presenter module may transmit the automatic speech recognitionresult to the smart mirror web application (S3-2). The semantic analyzermodule may face ID or voice ID information to the context manager moduleusing the received recognition analysis data (S4). In addition, thecontext manager module may transmit and receive a request and responseto an external server that can analyze dialog with the dialog systemmodule (S9-2).

In addition, the semantic analyzer module may transmit a dialog analysisrequest command for the external server capable of analyzing aconversation to the dialog system module (S5). The dialog system modulemay transmit the dialog analysis request command for the external serverto the external server (S6). In addition, the external server mayexecute weather, reminder and content recommendation operations using aweb-hook Service (S7), and the external server may transmit theexecution result to the dialog system module (S8).

The dialog system module may transmit a request command of the fashionrecommendation to a fashion recommendation engine (S8-1). The fashionrecommendation engine may execute the fashion recommendation operationaccording to a received request command and transmit a result back tothe dialog system module (S8-2).

The dialog system module may transmit information on the executionresult according to a received external server or the result of thefashion recommendation to the semantic analyzer module (S9-1). Thecontext manager module may transmit the information on the executionresult and the fashion recommendation result according to the externalserver performed by another interaction manager to the dialog systemmodule (S9-2), and the dialog system module may transmit the receivedinformation to the semantic analyzer module.

The semantic analyzer module may transmit the received execution resultaccording to the external server, the fashion recommendation result, NLGresponse results, averter behavior information, and the like to theservice presenter module (S10). In addition, the service presentermodule may transmit the corresponding information to the smart mirrorweb application (S11). The smart mirror web application may transmit theLNG result, gender, language, emotion and the like to a module thatperforms a text-to-speech (TTS) or speech-to-text (STT) functions (S12).

The interaction manager may use a hypertext transfer protocol (HTTP) tocommunicate with the external server or the fashion recommendationengine. In addition, the interaction manager may use the ROS method tocommunicate with a ROS framework and the smart mirror web application.

Referring to FIG. 15 , only some example embodiments have beendescribed, and it is not limited to the corresponding modules orconfigurations.

FIG. 16 is a flowchart illustrating an example operation of theelectronic apparatus according to an embodiment of the disclosure.

A method of controlling the electronic apparatus according to anembodiment of the disclosure may include dividing a plurality of inputdata sequentially input into a plurality of types and storing theplurality of input data in the memory 110 (S1605), determining at leastone of the plurality of types classified (e.g., divided) based oninformation related to the user voice command if a user voice command isrecognized among the input data (S1610), and providing responseinformation corresponding to the user voice command based on the inputdata of the determined type (S1615).

Determining at least one of the plurality of classified types (S1610)may determine at least one of the plurality of classified types based ontime information related to the user voice command.

In addition, storing in the memory 110 (S1605) may group the input dataclassified into the plurality of types in a predetermined time unit, andobtain representative data of the plurality of types corresponding toeach time unit to store it in the memory 110. Providing responseinformation corresponding to the user voice command may provide responseinformation corresponding to the user voice command based on thedetermined representative data.

Providing the response information corresponding to the user voicecommand (S1615) may compare the amount of change in the representativedata by the plurality of types corresponding to each time unit, andassign the largest weight to a type having the largest amount of changeto provide the response information corresponding to the user voicecommand.

The plurality of types may include at least one of gesture information,emotion information, face recognition information, gender information,age information or voice information.

In addition, determining at least one of the input data of the pluralityof classified types (S1610) may recognize the user voice command basedon at least one of the gesture information or the voice information.

In addition, when the user voice command is recognized in the inputdata, determining at least one of the input data of the plurality ofclassified types (S1610) may recognize the user voice command as apreset voice recognition unit, and determine at least one among theinput data of the plurality of classified types based on a time intervalbelonging to at least one voice recognition unit.

When a wake-up word is included in the user voice command, determiningat least one of the input data of the plurality of classified types(S1610) may recognize the user voice command as the preset voicerecognition unit based on the time interval where the wake-up word isrecognized.

When the response information cannot be provided based on the input datainput during the preset time interval after the wake-up word isrecognized, providing the response information corresponding to the uservoice command (S1615) may use the input data input in the time intervalbefore recognizing the wake-up word and provide the response informationcorresponding to the user voice command.

In addition, determining at least one of the input data of the pluralityof classified types (S1610) may determine at least one of the input dataof the plurality of classified types based on at least one ofinformation on the user intent or a control object recognized by theuser voice command.

The methods according to the above-described embodiments may be realizedas an application that may be installed in the existing electronicapparatus.

Further, the methods according to the above-described embodiments may berealized by upgrading the software or hardware, or a combination of thesoftware and hardware of the existing electronic apparatus.

The above-described example embodiments may be executed through anembedded server in the electronic apparatus or through an externalserver outside the electronic apparatus.

The method of controlling an electronic apparatus according to theabove-described various embodiments may be realized as a program andprovided in the user terminal device. For example, the program includinga method for controlling a display apparatus according to exampleembodiments may be stored in a non-transitory computer readable mediumand provided therein.

Various example embodiments described above may be embodied in arecording medium that may be read by a computer or a similar apparatusto the computer using software, hardware, or a combination thereof.According to an example hardware embodiment, example embodiments thatare described in the present disclosure may be embodied using at leastone selected from application specific integrated circuits (ASICs),digital signal processors (DSPs), digital signal processing devices(DSPDs), programmable logic devices (PLDs), field programmable gatearrays (FPGAs), processors, controllers, micro-controllers,microprocessors, electrical units for performing other functions. Insome cases, the embodiments described in the present disclosure may berealized as the processor 120 itself. In an example softwareconfiguration, various embodiments described in the disclosure such as aprocedure and a function may be embodied as separate software modules.The software modules may respectively perform one or more functions andoperations described in the present disclosure.

Methods of controlling a display apparatus according to various exampleembodiments may be stored on a non-transitory readable medium. Computerinstructions stored in the non-transitory readable medium allow thespecific apparatus to perform processing operations in the electronicapparatus according to the above-described various embodiments whenexecuted by the processor of the specific apparatus.

The non-transitory computer readable recording medium may refer, forexample, to a medium that stores data and that can be read by devices.For example, the non-transitory computer-readable medium may be CD, DVD,a hard disc, Blu-ray disc, USB, a memory card, ROM, or the like.

The foregoing example embodiments and advantages are merely examples andare not to be understood as limiting the present disclosure. The presentteachings may be readily applied to other types of apparatuses. Thedescription of the example embodiments of the present disclosure isintended to be illustrative, and not to limit the scope of the claims,and many alternatives, modifications, and variations will be apparent tothose skilled in the art.

What is claimed is:
 1. An electronic apparatus comprising: a memory; anda processor configured to control the electronic apparatus to: receive aplurality of input data comprising: image data captured via a camera ofthe electronic apparatus and/or an external device, and voice datacaptured via a microphone of the electronic apparatus and/or an externaldevice; group the input data classified into a plurality of types basedon a neural network; obtain representative data for each of theplurality of types based on the grouped input data to store in thememory; determine at least one type among the classified plurality oftypes based on a voice command being recognized among the input data;and provide response information corresponding to the voice commandbased on representative data corresponding to the determined at leastone type.
 2. The electronic apparatus as claimed in claim 1, wherein theprocessor is configured to control the electronic apparatus to determineat least one type among the classified plurality of types based on timeinformation related to the voice command.
 3. The electronic apparatus asclaimed in claim 1, wherein only some of the classified plurality oftypes are included in the determined type.
 4. The electronic apparatusas claimed in claim 1, wherein the processor is configured to controlthe electronic apparatus to: compare an amount of change inrepresentative data for each of the plurality of types corresponding toeach time unit, and assign a largest weight on a type having a largestamount of change to provide response information corresponding to thevoice command.
 5. The electronic apparatus as claimed in claim 1,wherein the plurality of types comprise at least one of gestureinformation, emotion information, face recognition information, genderinformation, age information or voice information.
 6. The electronicapparatus as claimed in claim 1, wherein the processor is configured tocontrol the electronic apparatus to recognize the voice command based onat least one of gesture information or voice information among the inputdata.
 7. The electronic apparatus as claimed in claim 1, wherein theprocessor is configured to control the electronic apparatus to:recognize the voice command as a preset voice recognition unit based onthe voice command being recognized in the input data, and determine atleast one type among the classified plurality of types based on a timeinterval of at least one voice recognition unit.
 8. The electronicapparatus as claimed in claim 7, wherein the processor is configured tocontrol the electronic apparatus to: recognize the voice command as thepreset voice recognition unit based on the time interval where a wake-upword is recognized based on the wake-up word being included in the voicecommand.
 9. The electronic apparatus as claimed in claim 8, wherein theprocessor is configured to control the electronic apparatus to: provideresponse information corresponding to the voice command using input datainput in a previous time interval before the wake-up word is recognizedbased on the response information not being provided based on the inputdata input for a preset time interval after the wake-up word isrecognized.
 10. The electronic apparatus as claimed in claim 7, whereinthe processor is configured to control the electronic apparatus to:determine at least one type among the classified plurality of typesbased on information on an intention of a user or an object to becontrolled which are recognized in the voice command.
 11. A computerimplemented method of controlling an electronic apparatus comprising:receiving a plurality of input data comprising image data captured via acamera of the electronic apparatus and/or an external device and voicedata captured via a microphone of the electronic apparatus and/or anexternal device; grouping the input data classified into a plurality oftypes based on a neural network; obtaining representative data for eachof the plurality of types based on the grouped input data to store in amemory of the electronic apparatus; determining at least one type amongthe classified plurality of types based on a voice command beingrecognized among the input data; and providing response informationcorresponding to the voice command based on representative datacorresponding to the determined at least one type.
 12. The method asclaimed in claim 11, wherein the determining at least one among inputdata of the classified plurality of types comprises determining at leastone type among the classified plurality of types based on timeinformation related to the voice command.
 13. The method as claimed inclaim 11, wherein only some of the classified plurality of types areincluded in the determined type.
 14. The method as claimed in claim 11,wherein the providing the response information corresponding to thevoice command comprises comparing an amount of change in therepresentative value for each of the plurality of types corresponding toeach time unit, and assigning a largest weight on a type having alargest amount of change to provide response information correspondingto the voice command.
 15. The method as claimed in claim 11, wherein theplurality of types comprises at least one of gesture information,emotion information, face recognition information, gender information,age information or voice information.
 16. The method as claimed in claim11, wherein the determining at least one type among the classifiedplurality of types comprises recognizing the voice command based on atleast one of gesture information or voice information among the inputdata.
 17. The method as claimed in claim 11, wherein the determining atleast one type among the classified plurality of types comprises, basedon a voice command being recognized in the input data, recognizing thevoice command as a preset voice recognition unit, and determining atleast one type among the classified plurality of types based on a timeinterval belonging to at least one voice recognition unit.
 18. Themethod as claimed in claim 17, wherein the determining at least one typeamong the classified plurality of types comprises, based on a wake-upword being included in the voice command, recognizing the voice commandas the preset voice recognition unit based on the time interval wherethe wake-up word is recognized.
 19. The method as claimed in claim 18,wherein the providing the response information corresponding to thevoice command comprises, based on the response information not beingprovided based on the input data input for a preset time interval afterthe wake-up word is recognized, providing the response informationcorresponding to the voice command using the input data input in aprevious time interval before the wake-up word is recognized.
 20. Themethod as claimed in claim 17, wherein the determining at least one typeamong the classified plurality of types comprises determining at leastone type among the classified plurality of types based on at least oneof an intention of a user or an object to be controlled which arerecognized in the voice command.